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Abstract 



Predictability of undesired events is a question of great interest in many scientific disciplines including 
seismology, economy, and epidemiology. Here, we focus on the predictability of invasion of a broad 
class of epidemics caused by diseases that lead to permanent immunity of infected hosts after recovery 
or death. We approach the problem from the perspective of the science of complexity by proposing 
and testing several strategies for the estimation of important characteristics of epidemics, such as the 
probability of invasion. Our results suggest that parsimonious approximate methodologies may lead 
to the most reliable and robust predictions. The proposed methodologies are first applied to analysis of 
experimentally observed epidemics: invasion of the fungal plant pathogen Rhizoctonia solani in replicated 
host microcosms. We then consider numerical experiments of the SIR (susceptible-infected-removed) 
model to investigate the performance of the proposed methods in further detail. The suggested framework 
can be used as a valuable tool for quick assessment of epidemic threat at the stage when epidemics 
only start developing. Moreover, our work amplifies the significance of the small-scale and finite-time 
microcosm realizations of epidemics revealing their predictive power. 

>: 

1 Introduction 

oo 

Predictability of catastrophic events such as earthquakes, epidemics, fracture or financial crashes [IHS] is a 
topic of increasing interdisciplinary interest. The predictability of these events is inextricably linked to the 
inherent complexity of the phenomena under consideration 1,2 . Here, we focus on epidemiology. Within 
\ this context, many studies have been devoted to prediction of the temporal incidence of epidemics (i.e. 

the evolution of the number of infected hosts in the course of an epidemic) [IHH] . Recently, an increasing 
number of papers have also considered the prediction of the spatio-temporal evolution of epidemics [9lll2j . 

Both the temporal and the spatio-temporal incidence depend on complex factors related to the trans- 
mission of infection and the properties of the hosts. For instance, the hosts are not identical in susceptibil- 
ity and transmissibility of infection due to difference in age, size, genotype and neighbourhood. [5,9,10, 13 . 
The transmission of infection is stochastic, meaning that a healthy host is infected by contact with in- 
oculum from an infected host with a certain probability only. Many epidemics, notably those involving 
transmission by invertebrate vectors, or by wind and rain for many plant pathogens, are subject to vari- 
ability in weather. This environmental stochasticity can also influence the evolution of epidemics in such 
heterogeneous systems [2] . All these factors make prediction of disease incidence an extremely challeng- 
ing and sometimes controversial task [31[T51|TB] . In Ref. [I], it was suggested that, although obtaining 
precise quantitative predictions for the incidence would be obviously desirable, qualitative predictions 
may be more valuable. This is very much along the lines of ideas from the science of complexity claim- 
ing that, despite the fact that giving accurate predictions for the detailed evolution of complex systems 
might be an illusory task, certain qualitative features of the evolution, such as occurrence or absence of 
a catastrophic event, could be more amenable for prediction [2 UT71IT8] . 

Here, we address the question of predictability of epidemics using a methodological framework inspired 
by the science of complexity. The main aim is to estimate the probability that an emerging epidemic will 
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invade a significant fraction of the population in the future. This quantity can be viewed as a qualitative 
feature of the complete spatio-temporal evolution of epidemics. 

We propose several methods for approaching the problem that offer different levels of precision. Our 
results suggest that the most precise methods do not necessarily lead to more reliable predictions. In- 
stead, parsimony seems to be the key ingredient for prediction based on inherently limited observations. 
The framework presented below deals with epidemics caused by a broad class of pathogens leading to 
permanent immunity of infected hosts after recovery (or death). There are numerous examples of such 
diseases affecting populations of humans [19, 20 , animals [21] and plants [22]. The advantage in analysis 
of such epidemics is that they are characterised by a well-defined final state consisting of only hosts that 
were never infected and hosts that were infected and became immune. In particular, we focus on the 
estimation of the probability that an epidemic is invasive in the final state. 

The proposed methods are first applied to prediction of invasion of a pathogen in an experimental 
model system in which the fungal plant pathogen, Rhizoctonia solani, spreads through a population of 
hosts represented by discrete nutrient sites. The properties of the sites, e.g. nutrient concentration, can be 
varied for different realisations of epidemics. Such a system is convenient for generation and observation 
of rapid, highly replicated and repeatable epidemics and it is used as a benchmark for description of our 
methodologies and analyses. The epidemic prediction analysis for experimental system is followed by 
a test of our methods in numerical experiments for epidemics spreading on networks of hosts arranged 
on a regular lattice. The advantage of investigating such epidemics is that their properties are known 
beforehand and this allows us to provide a precise analysis of the performance of prediction methods by 
comparing with the expected behaviour. 

2 Methods 

The methods follow four steps that are basic for any scientifically meaningful prediction of the behaviour 
of a complex system: (i) Observation of the initial evolution of the process over a certain period of time, 
iobs- (h) Construction of a model for description of the observed behaviour, (iii) Fitting the model to the 
data obtained from observations, (iv) Extrapolation of the behaviour to the future by using the model 
with the fitted parameters. These steps are interconnected and we argue that they should be kept at a 
similar level of complexity in order to make their interplay as consistent as possible. In this paper, we 
investigate whether or not such consistency is important for obtaining reliable predictions by exploring 
several combinations of strategies for steps (i)-(iii) (see summary in Table [1}. 

For concreteness, we illustrate our prediction methods for the particular case of fungal colony invasion 
in microcosms comprising populations of nutrient sites (agar dots) |23| . In these experiments, the central 
agar dot in ensembles with the geometry shown in Fig. [T] was inoculated by the soil-borne fungal plant 
pathogen R. solani and the spread of the fungal colony is scored in discrete time steps (e.g. daily). 
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Table 1. Overview of methods for prediction. Combination of possible data sets considered in step (i) and methods used for 
addressing steps (ii) and (iii) in the prediction process. The first column introduces a label for each set of methods, which are ordered 
according to their overall expected precision. The second column gives the format of the data obtained from observations in step (i) . 
The smallest precision of the data corresponds to methods A and B in which only the incidence, C(t), is used. Methods C-E use a 
limited spatio-temporal knowledge of the evolution of the infection given by the shell-evolution function F(l,t). Method F uses the time 
of infection of each host, {ii}, which is typically unknown from observations but can be inferred in step (iii). In step (ii), RF and CT are 
abbreviations for the Reed- Frost and continuous-time model, respectively. In step (iii), we have used several methods for fitting: 
Minimum Distance (MD), Approximate Bayesian Computation (ABC), and data-augmented Markov Chain Monte Carlo (DA-MCMC). 
Column five lists the parameters involved in each step for prediction of the fungal invasion in the agar-dot experiment. Methods based 
on the RF dynamics are parametrised by the transmissibility, T, and a time scale r exp . The CT dynamics used in methods E and F is 
parametrised by T, a characteristic time To, and a shape parameter for the time-dependence of the transmission of infection, k. The RF 
dynamics corresponds to the limit k — > oo of the CT dynamics. The MD method for fitting consists in minimising the parameter d 2 (last 
column) which measures the difference between observations and numerical simulations. The ABC fitting procedure assumes that a 
simulated invasion fits well the observed epidemic if d 2 < e, where e is a free parameter. As shown in the last column, the definition of 
d 2 depends on the descriptors for observations used in step (i). For methods A and B, d 2 is defined in terms of the observed and 
simulated incidences, c b s (t) — C b s (t)/N and c s ; m (i) = C s i m (t)/N, normalised to the number of hosts in the population, N. In methods 
C-E, d 2 is defined in terms of the shell-evolution function for observations and simulations. Sections I and II of SI Appendix give more 
details on the definition of models used in step (ii) and fitting methods used in step (iii). 
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In the following, we give a general description of the steps for prediction of epidemic invasion with 
particular assumptions suitable for the analysis of the fungal invasion experiment. The main details of all 
the methodologies are summarised in Table [TJ The methodology C is mainly used for illustration of our 
concepts. Details of other explored methodologies are given in SI Appendix. The motivation for choosing 
methodology C is twofold: (i) it keeps all the steps for prediction of the behaviour at a similar level of 
complexity, as illustrated by analysis of the fungal colony invasion in the agar-dot experiment; (ii) it is 
a parsimonious methodology that leads to predictions that are, at least, as robust as (or, arguably, even 
more robust than) those based on more sophisticated approaches (Table [lj. 

Step (i). The information that can be extracted from observation of the time evolution of epidemics 
is usually limited. In many cases, the only available information is the incidence, C(t) (the number 
of infected hosts), at subsequent observations, and occasionally the spatial location of infected hosts is 
also known, e.g. for epidemics in populations of plants [8,22,24,25 . These limitations have a dramatic 
influence on subsequent steps in the prediction process and it is crucial to identify which quantities 
are sufficient for prediction of the catastrophic event (i.e. the probability of an invasive epidemic in our 
case). As shown in Table[H we consider three types of observations. The first consists of discrete temporal 
observations of C(t) (Methods A and B in Table [1]). The second possibility (Methods C-E) considers 
discrete spatio-temporal observations giving the evolution of infection at discrete times, t, in shells at a 
'chemical distance' I from the initially inoculated host. As explained in Fig. QJa), such observations can 
be properly described in terms of a shell-evolution function F(l,t). As a third possibility (Method F), 
we use a method for data augmentation in step (iii) that infers the unobserved time of infection for each 

host, {u} [smug. 

Step (ii). We describe the evolution of the epidemic in terms of a spatial SIR (susceptible-infected- 
removed) epidemiological model where the hosts can be either susceptible (S), infected (I) or removed 
(R) [T9l EU [22J, EB] . This is a prototype model for a wide class of epidemics where disease leads to 
permanent immunity of hosts after recovery or death. In particular, this paradigm has been shown to be 
appropriate for description of fungal invasion [23,27,28 . In principle, a continuous-time dynamic model 
is necessary to provide a precise description of epidemic evolution characterised by stochasticity in times 
of infection and removal/recovery of hosts. Following this idea, it would be natural to use a model with 
continuous-time (CT) dynamics (described in SI Appendix). The drawback of this approach is that it 
requires knowledge of the precise times of infection of hosts, {U}, which are typically not available from 
discrete spatio-temporal observations in step (i). In order to match an appropriate model with the level 
of detail of observations, we consider the discrete-time dynamics model which reduces the SIR framework 
to the so-called Reed- Frost (RF) model [29]. This simplified description is not expected to capture the 
dynamical details of the evolution of the epidemic but reproduces well its final state [30l[3TJ. This is a 
very important consequence of the fact that, no matter how complicated the evolution of the epidemic is, 
the final state of an epidemic with death of infected individuals or permanent acquired immunity after 
recovery depends only on the probability T, called transmissibility, that the infection has ever been passed 
between each pair of connected hosts (as shown in Fig. [Tfb)). Although the transmissibility is expected 
to exhibit a certain degree of spatial heterogeneity in real epidemics, we make the minimal assumption 
that the trend of the epidemic can be well approximated by a RF process with a homogeneous effective 
transmissibility T. 

Step (iii). The goal of this step is to estimate the values of the parameters of the model used in 
step (ii) that give a good description of the observations. Consider, for definiteness, methodology C in 
Table [JJ Due to factors such as stochasticity and heterogeneity in transmission, a given observed spatio- 
temporal map for infection can occur for different values of the estimated transmissibility, T. However, 
some of these values for T are more likely to produce the observed spatio-temporal pattern than others. 
To account for this, we introduce the probability density function (p.d.f.), p(T), which quantifies the 
probability that the observed spatio-temporal pattern is reproduced by a certain value of T. As shown 
schematically in Fig. [He), P(T) is calculated by generating a large number of stochastic realizations 
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Figure 1. Steps for prediction used in methodology C (Table [T]). (a- left) Discrete spatio-temporal 
observations of evolution (spatio-temporal map) of a hypothetical epidemic spreading from the central 
host in a population of hosts (circles) arranged on a triangular lattice with lattice spacing a and L hosts 
per side of the hexagonal boundaries of the system. The arrangement of nutrient sites in the fungal 
invasion experiment analysed below is of this type. Numbers inside the circles denote the times, 
t = 0, 1, . . ., of infection of hosts by the observation time i b s = 2. Empty circles correspond to healthy 
(susceptible) hosts by the same time. The hexagons with dotted lines indicate the shells of hosts at a 
given chemical distance, I, to the centre of the system, (a- right) The spatio-temporal evolution of the 
epidemic is described by the shell-evolution function F(l, t) giving the relative number of hosts in layer / 
infected by time t. For instance, F(2, 2) =3/12 for the epidemic shown in the (a-left) panel, (b) In step 
(ii), the epidemic is described in terms of an SIR model. An infected host, 0, remains infectious during 
the infectious period r which, for simplicity, is taken as being constant over the whole population and 
set as a unit of time, r = 1. After the infectious period r, the host is removed, (R). During the time r, 
can transmit the infection to a neighbouring susceptible host, (S), with probability T (transmissibility). 
Alternatively, can be removed without passing the infection to (S) with probability 1 — T. In the 
discrete time Reed-Frost (RF) dynamics used in our approach, the infection is passed instantaneously 
from Q to (S) at t = r. (c) In step (hi), the fitting procedure consists in finding the probability density 
function (p.d.f.) p(T) for transmissibilities T (c-left) such that a RF process with T and shell-evolution 
function F s i m (l,t) give a good description of the observed shell-evolution function, F s i m (Z,i) (c-right). 
For visualization clarity in c-right, an example of F s ; m (Z,i) represented by the blue-grid surface does not 
fit well the observed shell-evolution function, F bs(M) (the shaded surface). The p.d.f. p{T) is obtained 
by running many RF epidemics with random transmissibility and minimising the parameter d that 
quantifies the difference between the observed and the RF shell-evolution functions, (d) Once p(T) has 
been obtained (continuous curves in c-left and d-left), the probability of an invasive epidemic, Pi nv (L), 
can be calculated by Eq. (fTJ) which involves the conditional probability of invasion Pj nv (T; L) for any 
given T (dashed line in d-lcft). The value of P lnv (L) is represented graphically in d-right by the area 
under the curve (shaded region) for the function Pj nv (T; L)p(T). 
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of the RF epidemic with transmissibilities T sampled uniformly from the interval [0, 1] and comparing 
their shell-evolution functions, F slWL (l,t) (see caption to Fig. [T] for definition), with the observed one, 
-Fobs Ideally, the distribution p(T) would correspond to the histogram of values for T producing 
shell-evolution functions F S i m (l,t) identical to F ^ s (l,t) but obtaining an exact match is computationally 
very demanding. Moreover, reproducing F b s (l,t) by a RF model is in general impossible in realistic 
epidemics for which time is not discrete. Therefore, we use a minimum distance (MD) algorithm to 
calculate p(T) approximately as the histogram of values of T minimising the quantity d 2 defined in 
Table Q] that measures the distance between F s - lm (l,t) and F \> a (l,t). In this way, the sampled values 
of T reproduce P bs(M) approximately rather than necessarily with distance dj = 0. This approach is 
similar to the Approximate Bayesian Computation (ABC) method that determines p{T) as the histogram 
of values of T for which dj < e, where e is a parameter used in the method [35]. Both ABC and MD 
algorithms give similar results despite the fact that MD does not require the use of an additional parameter 
e. In addition to these approximate methods, we have fitted the spatio-temporal evolution of the CT 
model proposed in step (ii) for comparison by means of a more standard Bayesian procedure using Markov 
chain Monte Carlo (MCMC) method with data augmentation (method F in Tabic [TJ. 

Step (iv). In the final stage of the prediction process, given p(T), we evaluate the probability 
Pinv(L) that the observed epidemic will ever invade a system of size L. The epidemic is defined as being 
invasive if the final cluster of removed hosts has reached at least one node on each of the six edges of 
the system. Otherwise, the epidemic is classified as being non-invasive. The conditional probability of 
invasion Pj nv (T;L) in a system of size L by an SIR process with a given transmissibility, T, can be 
calculated numerically by running many stochastic realisations of the epidemic and counting the fraction 
of invading events. As shown in Fig.[TJd), Pi nv (T; L) exhibits a sigmoidal dependence on T which indicates 
a non- invasive (invasive) regime of epidemics for relatively small (large) values of T. Once P; nv (T; L) and 
p(T) are known, the estimated probability of invasion can be calculated as follows: 

i 3 i„v(i)=/ P inv (T;L)p(T)dT . (I) 
Jo 

This formula defines the probability that the invasion occurs given our knowledge about the effective 
transmissibility encoded by p{T) (see a simple graphical interpretation in terms of the shaded area in 
Fig-Hid)). Importantly, Eq. (TJ gives an extrapolation of the behaviour of the epidemic to its final state 
without necessarily providing a detailed description of the actual evolution leading to such a state. 

3 Application to fungal invasion 

In the fungal invasion experiments, the spatio-temporal maps of infected agar dots were scored daily 
over 21 days (see two typical patches of colonisation after 21 days in Fig. HJa)). The transmissibility 
in this experiment corresponds to the probability of fungal colonisation between two adjacent agar dots 
and it was controlled by variable lattice spacing, a = 8, 10, 12, 14, 16, 18 mm. Clearly, the experimental 
setup is restricted both in space and time. Our aim is to use these limited observations to estimate the 
probability of invasive epidemics in larger systems and for longer times. The analysis is performed for 
each individual realisation of the experiment (6 replicates per value of a) . 

In order to make a proper comparison between the experimental observations with the RF model 
used in methods A-D, it is necessary to rescale the time step of the RF dynamics with dimensionlcss 
t = 1 to r eX p measured in days. The value of r eX p is not known and it is treated at the same level as the 
transmissibility. More explicitly, we deal with a bi-variate probability density function, P2{T, r eX p), which 
can be determined for each epidemic with a simple extension of the methods explained in Sec. [2] (step (iii)) 
for obtaining p(T). The estimated Pj nv is obtained from Eq. (fTJ) by defining p(T) as the marginal p.d.f., 
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Figure 2. Fungal invasion in the system of agar dots placed on a triangular lattice, (a) The observed 
mean probability of invasion P cxp obtained by counting the relative number of invasive epidemics after 
21 days is shown by the continuous line. The probability of invasion after 21 days was estimated for each 
replicate by observing the initial evolution of colonisation during t b s — 10 days. The corresponding 
mean over replicates with the same value of a is shown with a different symbol type (the same as in 
Figs. |U[5]) for each method for prediction. The inserts show the invasive (left) and non-invasive (right) 
state of the epidemic after 21 days for two representative replicates with lattice spacings a = 10 mm and 
a = 14 mm, as marked by arrows. Solid (open) circles in the inserts represent colonised (not colonized) 
dots, (b) Prediction of P- lnv with different methodologies for individual replicates of the epidemic in a 
large system of size L — 51 obtained from observations during t Q \, s — 21 days of the smaller 
experimental system (L < 8). The mean of P lm over replicates for each value of a is shown by different 
symbol types corresponding to different methodologies. The mean probability of invasion P exp obtained 
by counting the relative number of invasive epidemics after 21 days is shown by the continuous line. 
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Figure 3. Estimates of the transmissibility for fungal invasion in the system of agar dots. The p.d.f.'s 
p(T) obtained with different fitting methodologies are plotted for the fungal colony invasion in a 
population of agar dots with lattice spacing a — 10 mm. Estimates correspond to observation of the 
fungal spread during i b s = 21 days. Different symbol types correspond to different fitting 
methodologies, as marked in the legend. Fig. SI in SI Appendix shows similar plots for six replicates of 
the system. 

p(T) = p2(T, r e xp)drexp. As explained in more detail in Sec. I of SI Appendix and summarised in 
Table[T] the continuous-time SIR model used in methods E and F involves three parameters: T,to, and k. 
The fitting of the data results in a p.d.f. ps(T, To, k) from which we obtain p(T) = J pa(T, To, fc)drodfc. 
The probability of invasion is then calculated from Eq. (fTJ) in the same way as for methods A-D. 

3.1 Uncertainty of the estimated transmissibility 

The functions p(T) obtained for the fungal invasion experiments typically exhibit a pronounced peak (see 
the results for one replicate in Fig. [3] and similar results for more replicates in Fig. SI of ST Appendix). 
The peaked shape of p(T) suggests that T can be suitably described in terms of its mean value (T) 
and standard deviation ot = ((f 2 ) - (T) 2 ) 1/2 - For each methodology, Fig. H shows the average over 
replicates of (T) and cry as a function of the lattice spacing. These estimates correspond to observations 
of the evolution of infection during t b s — 21 days. All the methods give similar values for (T) which 
have a clear and expected tendency to decrease with increasing a. The uncertainty in T, quantified by 
<7t, exhibits greater variations between methodologies but it takes values that are smaller than (T) for 
all the methods and lattice spacings (Fig. BJb)). This means that (T) is a good measure of the typical 
value of the transmissibility. However, the value of (T) on its own does not necessarily provide a good 
approximation for P; nv because the width of p(T) can bring a significant contribution to the integral in 
Eq. (fTJ. This is explicitly shown in Sec.|U 

Comparison of (T), ar, and p(T) for different methods leads to the following conclusions: 

(1) Given a level of description (step (i)) and a model (step (ii)), the estimates of the transmissibility 

obtained with ABC and MD methods are, in general, in good agreement (cf. method A with method 
B, and method C with method D in Figs. [3] and 

(2) Given a level of description (step (i)) and an estimation method (step (iii)), the posteriors obtained 
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Figure 4. Statistical characteristics of the estimates of the transmissibility for fungal invasion in the 
system of agar dots. Dependence on the lattice spacing of (a) the mean value (T), and (b) standard 
deviation ut of the transmissibility calculated from the p.d.f. p{T) corresponding to observations 
during i bs = 21 days. For clarity, each symbol gives the average of (a) (T) and (b) ar over 6 replicates 
of the experiments for each lattice spacing, a. As marked in the legend, different symbol types 
correspond to different methods for addressing the steps (i)-(iii) summarised in Table [TJ 



using discrete- and continuous-time models are in reasonable agreement (cf. method D with method 
E in Figs. [3] and S]). The only difference is a trend for p{T) corresponding to CT dynamics to have 
a "heavy tail" for large values of T (see the replicate 4 in Fig. SI, SI Appendix). Large values of T 
are correlated with large values of the time scale r (i.e., slower processes with high T) and small 
values of the shape parameter k, that are ruled out by the RF model. This effect becomes more 
important for larger values of the lattice spacing, as indicated by the large values of (T) and <7t 
corresponding to method E (asterisks in Fig. [4]). 

(3) The estimates from augmented-data MCMC (methodology F) are in general different from those 
obtained by other methods. Moreover, the p.d.f. 's p{T) obtained with MCMC show no systematic 
trend with respect to the other methods. With respect to, e.g., the p(T) obtained with MD, they 
can be located at slightly higher (replicates 5 and 6 in Fig. SI) or lower (replicates 2 and 3) values 
of T, or approximately at the same value (replicates 1 and 4). Moreover, the variation in the peak 
position of p(T) between different replicates is larger than for the other methods. This suggests 
that the MCMC method is more sensitive to fine details of the evolution of the epidemic. A possible 
explanation is that augmented-data MCMC involves the inference of the unobserved colonisation 
times and thus is intrinsically individual-based, in contrast to shell-based (or mean-field) methods, 
which try to match the colonisation times in an approximate manner only. 

3.2 Comparison of fitted models with experimental data 

In order to assess the quality of the assumptions used for estimation, we compare the fitted models with 
the available experimental data. For methods A-D, we compare the incidence and shell-evolution function 
obtained numerically (RF dynamics) with values for T and r eX p sampled from p2(T,T exp ) estimated by 
means of the spatio-temporal maps at maximum observation time t b s = 21 days with the actual incidence 
and shell-evolution function for each epidemic. Similarly, the fits from methods E and F are compared 
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with experimental observations by running numerical epidemics with parameters for the CT dynamics 
sampled from the p.d.f. psiT, to, k). We make a quantitative comparison based on squared distances d\ 
and d 2 j: (cf. Table [1} between simulated epidemics and experimental fungal invasions. More explicitly, we 
define the root mean square (rms) distances, 

' dl n 1/2 



Ac ={tt) • ( 2 ) 

( A \ 1/2 

where At — 21 days is the time interval used for calculations of d\ or df. The quantity l mayi is the 
maximum chemical distance to the centre of the system of agar dots. Its value decreases with the lattice 
spacing and ranges from Z max = 2 for a = 16 mm to Z max = 8 for a = 8 mm [23j . From the definition 
of d\ given in Table [TJ it is easy to see that Ac gives the typical deviation of the simulated incidence 
per unit host at a given time, c s ; m (i), from the observed incidence per host at the same time, c bs(i)- 
Similarly, AF gives the typical deviation of the simulated shell-evolution function, F S j m , evaluated at any 
spatio-temporal coordinates (l,t) from the observed value at the same coordinates. 

Fig. [5] shows the mean of the rms distances obtained by averaging over stochastic simulations and 
over replicates with given lattice spacing. The low values of the rms distances (Ac < 0.2 and AF < 0.3) 
indicate that the observed C(t) and F(l,t) are statistically well described by the fitted models. For 
any given method and lattice spacing, we obtained Ac < AF, which is expected because reproducing 
the spatio-temporal evolution represented by F(l,t) is more demanding than capturing the temporal 
evolution of the colonisation given by c(t). Both Ac and AF tend to be larger for a around 10-12 mm 
which, as shown below, corresponds to cases that are close to the invasion threshold (i.e. where P{ nv 
decreases from 1 to on increasing a, as shown by the continuous line in Fig. (2fa)). Variability between 
replicates of epidemics with given T is larger around the invasion threshold, which is associated with a 
critical phase transition and characterised by large fluctuations 13, 21,23,26 28 . As a consequence, the 
quality of fits is lower in the vicinity of the invasion threshold and this leads to larger values of Ac and 
AF. 

Methodology C gives a good balance between performance and number of parameters involved. Meth- 
ods C-E based on an approximate spatio-temporal description of epidemics given by F(l, t) result in more 
accurate predictions than methods A (squares in Fig. [5]) and B (diamonds) that neglect spatial features of 
invasion. Moreover, the approximate methods C-E also perform better than even methodology F (trian- 
gles in Fig. [5]) despite the fact that the latter aims for a more precise spatio-temporal description. A more 
qualitative and visual comparison of estimated and observed C(t) reveals similar differences between all 
the methodologies (see details in Sec. Ill of SI Appendix). 



3.3 Two applications for prediction methods 

As a first application of the proposed methods, we have studied the predictive power of the estimates 
of the probability of invasion and the incidence by calculating p(T) from the early stages of the actual 
epidemic, i.e. for t ^ s < 21 days. In particular, based on the estimated p(T) for i b s = 10 days, we 
have obtained estimates for the probability of invasion at time t = 21 days (squares and dashed line in 
Fig- Ufa)) and compared them with the probability P cxp of invasion at t = 21 days obtained directly from 
the experimental data. The observed probability of invasion, F cxp , is estimated by counting (for each a) 
the fraction of replicates in which the fungus has reached the six outer edges of the experimental system 
by 21 days. The mean of F inv averaged over replicates with the same value of a (symbols in Fig. [2Ja)) 
gives a reasonable estimate for the observed mean of P exp (solid curve in Fig. HJa)) after t — 21 days 
for most of the methods. Overall, the best predictions for Pi nv are obtained with methodology C (solid 
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Figure 5. Comparison of fitted models with experimental data. Mean rms distances (a) Ac and (b) 
AF between observed and simulated fungal invasions in populations of agar dots. The mean value of 
rms distances is obtained by averaging over 10 4 stochastic realisations of simulations and over 6 
replicates for each lattice spacing, a. Simulations are based on fits to observations over t b s = 21 days. 
Different symbols correspond to different methodologies summarised in Table [TJ as marked in the legend. 



circles in Fig. [2a))- As expected, P nv decreases with increasing a for all methodologies, illustrating the 
existence of the threshold for epidemics around a c± 12 mm [23) . Similarly, the experimentally observed 
incidence and shell-evolution function are statistically well captured by the numerical extrapolation for 
their simulated counterparts up to time t = 21 days obtained from observations over times i b s < 21 days. 
A visual illustration of the agreement between the observed and predicted incidence is given in Sec. Ill 
of Appendix SI. Fig. |6] shows the rms distances between the observed and predicted evolutions from day 
11 to day 21. The time interval used to calculate Ac and AF from Eqs. (5) and ([3j) is At = 11 days. The 
relative trends of Ac and AF between methods and lattice spacings are similar to those reported in Fig. [5] 
for the comparison between observations and numerical simulations with i b s - The main difference is that 
the values of the rms distances corresponding to predictions of the evolution of colonisation (Fig. [6]) are 
systematically larger than those obtained by simply comparing observed evolutions with their respective 
fittings (Fig. [5|. This is in agreement with the intuitive idea that predicting the a priori unknown 
evolution of a system is more challenging than reproducing a fitted evolution. 

As a second application of our methodology, we have calculated P nv at the end of the epidemic as 
a function of the lattice spacing in systems of size L = 51, i.e., larger than the experimental samples 
of sizes L — 2, ...,8 which decrease with increasing lattice spacing (see the two populations for different 
value of a shown in Fig. [SJa)). Such predictions are based on estimates for the transmissibility obtained 
from observations up to day £ bs- As expected, p nv decreases with increasing a, illustrating the existence 
of the threshold for epidemics around a ~ 12 mm [53]. The results of applying each of the prediction 
methods are shown in Fig. [21b) for the mean probability averaged over replicates for each value of a. All 
the methods except E give similar predictions for P n v The large values of P nv predicted by method 
E are a consequence of the "heavy tail" of the p.d.f. p(T) which gives a significant weigh to the high 
values of P nv for large T in Eq. (TjQ) . The dependence of P nv on a differs from the observed probability of 
invasion, P eX p (continuous curve in Fig.[2lb)). The difference can be qualitatively understood by recalling 
that P nv gives an extrapolation both in space and time. Indeed, P nv > P cxp because some epidemics 
that are non-invasive after 21 days have a certain probability to invade a system of size L = 51 for 
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Figure 6. Comparison of the predicted and observed evolution of fungal invasion in the system of agar 
dots. Predictions of the fungal spread in the period of time between days 10 and 21 are made from 
observation of the initial spread during i Q b s = 10 days. The vertical axes show the mean over replicates 
of the rms distances (a) Ac and (b) AF between observed and predicted fungal evolution during the 
time interval 11 — 21 days. The mean value of rms distances is obtained by averaging over stochastic 
realisations of simulations and over 6 replicates for each lattice spacing, a. Simulations are based on 
fittings to observations over t b s = 10 days. Different symbols correspond to different methodologies 
summarised in Table [TJ as marked in the legend. 



t > 21 days. In addition, both infectivity and susceptibility are expected to be subject to heterogeneity 
in the agar-dot system due to, e.g., inherent variability. Based on the results presented in Sec. 0] for 
numerical experiments with heterogeneity in transmission, we expect the estimated Pi nv to give an upper 
bound to the actual probability of invasion. This can also contribute to the difference between P; nv and 
P e xp for large values of a. 

4 Numerical experiment 

The quality of the predictions presented in the previous section is influenced by the quality of the ob- 
servations in step (i), the suitability of the model chosen in step (ii) for description of the data, and 
the fitting procedure used in step (iii). In principle, the effect of these factors on predictions could be 
minimised by optimising the procedures used in each step for prediction. Stochasticity associated with 
the transmission of infection also influences the ability of making reliable predictions. In contrast to the 
previous factors, stochasticity is inherent to the nature of the system and its negative effect on predictions 
cannot be minimised without modifying the system. In this section, we present a sensitivity analysis of 
our methods by applying them to prediction of invasion for numerically-simulated epidemics where the 
main factor compromising predictability is the intrinsic stochasticity in transmission of infection. The 
advantage in this case with respect to more realistic situations is that both the transmissibility, T, and 
the probability of invasion, P{ nv (T; L), are known and it is then possible to investigate the performance 
of the estimates for p{T) and P- lm {L) by comparing with the known quantities. 

We first consider the simplest situation when the observed epidemics follow the RF dynamics with 
homogeneous transmission (i.e. T is the same for all pairs of nearest neighbours in the population). The 
idea is to run numerical experiments with known T, observe the evolution of the epidemic over an initial 
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Figure 7. Numerical experiments of SIR epidemics with homogeneous transmissibility. 

Hosts are placed on the nodes of a triangular lattice of size L — 51 (cf. Fig.QJa)). The evolution of 
epidemics starting from the central host is observed over time t < t Q ^ s = It. The line marked by circles 
shows the dependence of the conditional probability of invasion, Pj nv (T; L), on transmissibility obtained 
by the simulations in the system of size L = 51. The shaded region shows the levels of confidence in 
percentage of the p.d.f p(T) around the most probable transmissibility, (continuous black line) 
corresponding to each value of P lnv (L). The horizontal dashed line illustrates the case of estimations 
giving P[ nv (L) = 0.2. If the observed epidemic has a value of the transmissibility such as T\ that is to 
the left of the curve for P nv (line with circles), the estimated P lnv (L) overestimates P lnv . In contrast, 
for values of the transmissibility that are to the right of the curve for P nv (e.g. Th), the probability 
P\m{L) underestimates p nv . 

interval of time, t bs, and then apply the methods described above to calculate P; nv (L) assuming that T 
is unknown (as it occurs for real epidemics). 

For concreteness, we consider the arrangement shown in Figflja) and use methodology C for steps 
(i)-(iii). RF epidemics are observed during t Q b s = 7t with the aim of estimating the probability that 
they will invade a system of size L = 51. Note that over the time interval t < t ^ s = 7 the epidemic 
at most invades a hexagon of size L — 15. Then, the behaviour of the epidemic is extrapolated both in 
space and time. We proceed by, first, calculating the p.d.f. p(T) for the estimated transmissibilities T 
compatible with observation (spatio-temporal map). Fig. QJc) shows an example of p(T) obtained from 
the analysis of the evolution of an epidemic with T = 0.4. In general, the most probable estimate for the 
transmissibility, T*, corresponding to the maximum of p(T) for a single epidemic, differs from T but not 
significantly. In many cases, T lies within the 68% confidence interval for p(T) around its maximum (see a 
more detailed discussion in SI Appendix). The distribution p{T) allows the probability of invasion P nv (L) 
in the system of size L — 51 to be estimated using Eq. (JTJ. We have applied this prediction method to 
many (~ 10 4 ) spatio-temporal maps created with known transmissibility T spanning the interval [0, 1]. 
For each value of estimated P mv (L), the distribution p(T) is represented by a horizontal slice of the 
shaded area in Fig. [7] (see e.g. the slice along the dashed blue line corresponding to Pi nv (L) = 0.2 with 
darker colour corresponding to higher probability relative to the maximum of p(T)). The black ridge in 
the shaded area corresponds to the most probable transmissibility, T*, for each value of Pi nv (L). 

To test the quality of the predictions, the estimate of the probability of invasion P nv (T; L) is compared 
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with the probability P nv (P; L) that would be obtained if the exact value of T was known a priori (see 
line marked by circles in Fig. [7]). Making such a comparison we can see that for epidemics with low 
transmissibility where invasion is possible but not highly probable, Pi nv (L) overestimates Pi nv (T;L) for 
most of the possible values of T contributing to P lnv {L) (the shaded area including the ridge region 
corresponding to the typical values of T is mainly above the line marked by circles in Fig. [7]). This means 
that the estimations are biased upwards and the most likely is that the actual probability of invasion 
will be smaller than predicted. In other words, such predictions will typically give a safe bound for the 
probability of invasion. Obviously, there is a non-zero probability that the observed epidemic has a large 
value of the transmissibility (such as T n in Fig. [7]). In this case, p„ v (L) would underestimate the actual 
probability P inv . For more invasive epidemics (i.e. epidemics with P inv > 0.5), the shaded area is mainly 
below the line marked by the circles in Fig. [TJ meaning that the predicted probability of invasion P inv 
underestimates p nv for most of the possible values for T, including the most probable, In these 
situations however, both P nv and P nv are large and the predictions allow for a reasonable assessment 
for invasion to be done. In Sec. VI of SI Appendix we show mathematically that differences between 
Pnv(i-) and Pnv evaluated at the most probable transmissibility T* arc mainly dictated by the curvature 
of Pnv around T* and is intrinsically linked to the non-zero width of the p.d.f. p{T). This general result 
implies that the biases in the probability of invasion at low and high transmissibility are independent of 
the fitting method (Table [TJ) because all methods lead to a p.d.f. p(T) with non-zero width. 

The results presented above correspond to RF epidemics with homogeneous transmission. A similar 
approach has been used to deal with more realistic epidemics where the transmission of infection is 
heterogeneous due to variability in the infectivity and the susceptibility of hosts. As already mentioned 
in the previous section, the estimated p nv (L) for such epidemics usually gives a bound to the actual 
probability of invasion that is even safer than that obtained for cases with homogeneous transmissibility 
(see SI Appendix for more detail) . 

5 Discussion 

The methodology introduced here focuses on the prediction of relatively simple but important features 
of epidemics. This is in contrast to much previous work dealing with the prediction of quantitative 
properties of epidemics such as the detailed spatio-temporal evolution of the incidence. The advantage 
in dealing with simple characteristics of epidemics is that they can be more easily predicted in terms of 
simplified description of the spatio-temporal evolution. Our results demonstrate that, under quite general 
assumptions, it is possible to give reliable prediction of the final state of an epidemic with permanent 
immunisation from the early stage of its evolution. Such a prediction is possible even for a single realisation 
of an epidemic and thus the framework is relevant to inherently unique real-world epidemics. In fact, 
our approach can be applied for prediction of epidemics in real systems characterised by a wide range of 
space and time scales (e.g. crops) based on micro- or meso-cosm experiments of finite size and over finite 
time. 

The results obtained for experimental fungal invasion by using approximate methods (C-E in Table [TJ) 
are more robust than those based on supposedly more precise methodology F. This might be a consequence 
of the interplay between the high fitting precision for MCMC methods involving data augmentation with 
a poor description of the actual dynamics given by the continuous-time model fitted to the observations. 
A model capturing dynamical details at a level consistent with that offered by the fitting procedure 
might exhibit more predictive power than that presented here. This is an illustration of the importance 
of keeping all the steps involved in prediction at a similar level of complexity in order to give reliable 
predictions. For practical applications, the particular method to be used for obtaining the most reliable 
prediction depends on the problem in hand. The general rule that seems to emerge from our analysis 
is that a reasonable method should use as much informations as available from observations, and avoid 
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inferring data that is not directly available unless this is strictly required by the problem. This is the 
case for methods C-E in our particular study of fungal colony invasion in the population of agar-dots. 
Methods A and B use less information than available from observations (i.e. they use C(t) instead of 
F(l,t)) while method F infers information that is not available from observations. 

The proposed methods assume that epidemics can be approximately described by an effective trans- 
missibility that is constant over time and homogeneous in space. However, their applicability goes beyond 
epidemics with constant transmissibility. In particular, we have shown that the method gives reliable 
predictions in the presence of spatial heterogeneity in the transmission of infection. We expect that the 
methododology can also be successfully applied to cases where the transmissibility changes over time but 
it remains within the bounds for the effective transmissibility estimated from the early stage. 

We have characterised the final state of epidemics by the probability of invasion. This quantity is 
suitable for systems with well-defined boundaries such as, e.g. the population of agar-dots analysed above. 
In cases where hosts are placed on the nodes of more complex networks, the boundaries of the system 
are not necessarily well-defined |33j and it is more convenient to characterise the final state of epidemics 
in terms of the mean number of removed (i.e. ever infected) hosts, Ar. Our approach also applies to 
such complex networks. The formula given by Eq. ([1} provides an estimated size, N-r(L), if Pj nv (T; L) is 
replaced by the function JVr(T; L) giving the size of SIR epidemics with given transmissibility, T. 

Our methods have been applied under the assumption that the network of contacts between hosts 
remains unchanged during the course of epidemics. Such an approximation has been widely used in the 
past |33j and it is reasonable for cases in which the rate of change of the configuration of contacts is much 
smaller than the removal rate of hosts (i.e. ~ r _1 in our notation). This condition is clearly satisfied 
for the fungal colony invasion of the population of agar-dots considered here and also for many other 
epidemics associated with pathogens spreading in, e.g., networks of plants [22], farms [9], or airports |11) . 
This paradigm is also applicable to the spread of many infections in human populations (for instance, 
measles or SARS that have recovery periods of the order of few days). In contrast, the dynamics of 
contacts between humans plays an important role for other infectious diseases such as syphilis with a 
recovery period of 100 days [34] . A possible strategy to make predictions in this kind of networks would 
involve inferring the mixing parameter for contacts (as defined, for instance, in [34]) based on observations 
(step (i)) in a similar way as we estimated the parameters of the SIR model in step (iii). 

Another interesting task would be to extend the ideas presented here to deal with epidemics with 
persistence where immunity after recovery is not permanent (i.e. recovered hosts can be re- infected) . 
In this case, the simplest model for description of observations is the SIS model (susceptible-infected- 
susceptible) [35j and a possible quantity to be predicted would be the stationary prevalence of infection 
(i.e. the density of infected hosts in the stationary state reached after a transient [35). 

Due to stochasticity in transmission of infection, it is not possible to determine the parameters of 
a model describing an epidemic with absolute certainty even if the epidemic is observed during a long 
time £ bs before attempting inference. Furthermore, if it were possible to determine the exact value of 
the parameters, it would still be impossible to make arbitrarily precise predictions of the evolution of 
the epidemic in the future or predict with absolute certainty if the epidemic is going to be invasive or 
not (instead, one has to deal with the probability of invasion). The uncertainty in the prediction of 
the evolution of epidemics grows monotonically with the look-ahead time (see, e.g. the forecast of the 
incidence in Sec. IV of SI Appendix). There exists a prediction horizon beyond which the uncertainty 
of predictions of the evolution of the epidemic is too large for predictions to be useful. The location of 
the horizon is epidemic-dependent and also depends on how precise we want our predictions to be. In 
contrast, for a pure SIR epidemic, there is no prediction horizon for quantities such as Pi nv or Ar that 
only depend on the transmissibility. In other words, different replicates of epidemics with given T will 
follow different evolutions that have a prediction horizon but will lead to the same P lm or Ar [26 . 30^31] . 
More complex nonlinear dynamics for transmission associated with, for instance, a seasonal component in 
the transmission rate may lead to chaotic behaviour [36) . Predictability of catastrophic events in systems 
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exhibiting chaotic behaviour is a non-trivial question that has been widely studied in the past [37] and 
still receives considerable attention at present |38j . Even in the absence of stochasticity, the prediction 
horizon in these systems is intrinsically limited due to the high sensitivity of chaotic processes to the initial 
conditions and the values of the parameters. In addition, the accuracy of predictions does not necessarily 
increase monotonically with the observation time, i bs, before prediction [39 . In such situation, it would 
be necessary to estimate the value of i b s leading to the most reliable prediction. Due to all these factors, 
the methods proposed in this paper may not work when applied for prediction of catastrophic events 
in nonlinear dynamical systems. However, the ideas presented here together with approaches proposed 
for prediction in nonlinear dynamical systems may help in devising strategies for prediction in stochastic 
non-linear systems. 
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I. CONTINUOUS-TIME EPIDEMIOLOGICAL MODEL FOR STEP (II) 

In this section, we present the three-parameter continuous-time (CT) model used in methods E and 
F (Table 1, main text) to address step (ii) for prediction. In principle, this model is more realistic than 
the Reed-Frost (RF) model used in methods A-D and allows the existence of possible effects caused by 
the discrete-time character of the RF description to be explored. We use the fungal invasion experiment 
as a benchmark for the comparison between different models. In the CT, the spread (transmission) of 
the fungal colony between two neighbouring dots is treated as a time-inhomogeneous Poisson process [l| . 
The waiting time distribution f(t) for each transmission event can be modelled by a Weibull distribution 
multiplied by the transmissibility T: 

f k-i _(t\ k 

f(t) = T- r e-^J , (S.l) 

T S 

where tq is the characteristic time scale of the process, and A: is a shape parameter of the distribution. 
Given the cumulative probability function P(t) = f u=0 f(u)du, the survival function S(t) (giving the 
probability that transmission did not occur by time t) obeys the following relation: 

S(t) = 1 - P(t) = 1-T(l- e~fe) } , (S.2) 

The rate of the transmission process, (j>(t), is a function of the time since colonisation of the donor dot 
and is given by the expression: 

f(t) dln(5(t)) 

^ = W) = — d*— ■ 

In the limit k — > oo, the CT model reduces to the RF model, with the same value of T and infectious 
period r = tq. Indeed, for k — > oo, the survival probability given by Eq. (|S.2p becomes a step function, 



S(t) 



1, t < T 

1-T, t>T , 

which corresponds to a RF model, in which infection can be transmitted from an infected host to a 
susceptible neighbour with probability T only once the infectious period tq has passed. 



II. METHODS FOR FITTING MODELS TO DATA IN STEP (III) 

The aim of this section is to give a detailed description of the different methods used for fitting 
models M to data T> described in step (iii) of the methods for prediction proposed in the main text. 
The ideas presented in the main text can be framed within a Bayesian approach which assumes that 
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the parameters describing A4 are random variables. The aim of step (hi) is to evaluate tt(9\T>), the 
probability density that A4 with parameters describes the data. According to Bayes' rule: 

tt(0|X>) cx P(2?|0)7r(0), (S.3) 

where vr(0) is the prior distribution of the parameters, reflecting our initial belief in their values, and 
P(£>|0) is the likelihood (the probability of the data given the parameters). 

Several challenges arise when using this Bayesian approach in the analysis of epidemic spread. The 
first difficulty is associated with inherent limitations in the observations. A complete spatio-temporal 
data set would contain the precise time of colonisation tj of each host j in the population, i.e. T> = {tj}. 
Unfortunately, it is often the case that observations do not provide such detailed information. In the 
particular experimental data set considered in the main text, the status of each dot is only recorded at 
discrete (1-day) time intervals. Hence, the actual dataset is T> = {dj}, where dj is the day when dot 
j was first observed as colonised. We have explored two possible ways to deal with the lack of precise 
information. The option used in methods A-E (Table 1, main text) involves identifying descriptors for 
the evolution of the epidemic that are suitable for the prediction of the catastrophic event. The lower- 
dimensional descriptors used in this work are the cumulative incidence, C(t), and the shell-evolution 
function, F(l,t). Another option is to use data augmentation that treats the unobserved colonisation 
times as parameters to be estimated. This is the procedure followed in method F. 

Another source of difficulties is due to the fact that the analytical calculation of the posterior tt(9\T>) 
is, in general, impossible [3]. Therefore, it is common to resort to numerical methods to sample from 
tt(G[D). In order to do this, we propose a new approximate method (denoted as MD), which is based 
on the calculation of the minimal distance between two datasets and does not require knowledge of the 
likelihood P(X>|0). In addition, we have used two known methods: 

Pi 

• a method belonging to the class of Approximate Bayesian Computation (ABC) [J], which calcu- 
lates an approximate posterior, and which shares several basic features with the MD method. 

• Markov-chain Monte Carlo (MCMC) with data augmentation P, Q], which relies on the exact 
analytical form of P(P|0) to sample from the exact posterior 7r(0|D). 

The ABC method was used with multiple purposes: 

(a) to test the new MD method (involving the minimisation of a given distance between observed and 
simulated data) against an already-known method (ABC, which involves a cutoff on the same 
distance). 
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(b) to test different choices in step (ii), by comparing estimations obtained using the RF modei with 
estimations obtained by means of the CT model. 

The MCMC method was employed to test the new MD method against a widely-used technique that 
uses both a different level of description (site-level vs. shell or MF level) and a different model M (CT 
dynamics instead of RF dynamics) . 

A. Minimum distance (MD) method 

Let Dobs be the observed data, 9 a set of candidate parameters, and T> s [ m a simulated dataset 
generated using 9. Then, if D bs = Adm, the vector of parameters, 9, is drawn from 7r(0|2? o b s )- In 
practice, obtaining an exact match between observed and simulated dataset is often computationally 
unfeasible, and one has to resort to an approximate match. To this end, we define a metric d 2 (Vi,V2) 
that measures the distance between two datasets T>\ and T>i . The aim of the MD method is to calculate 
the distribution of parameters 9 that minimise d 2 (T>i,T>2) and gives an approximate posterior. The 
choice of the metric d 2 (V s - im , fobs) is problem-specific and, in general, not unique. We have tested two 
different metrics, corresponding to different descriptors of the data (cf. step (i) of the main text): 

1. For the shell-based description, we used d 2 = J2i t (-^sim('j t) — F Q ^ s (l, t)) 2 , where / enumerates 
the shells, t is the discretised time (observation times in days), and F is the shell-evolution 
function. 

2. Another option is to ignore any spatial information from the data (mean- field (MF) description) 
and consider only the total fraction of colonised sites at time t, c(t) = C(t)/N. In this case, we 
chose the distance function d 2 = J2t ( c sim(t) — c bs(*)) 2 > where t is again the discretised time. 

The algorithm to implement the MD method considers two indexes, n, counting the parameter 
vectors resulting from the minimisation procedure, and r, counting the iterations. The values of n and 
r are in the range n, r £ N with maximum values n max and R, respectively, and proceeds as follows: 

MD.l Set n = 
MD.2 Setr = 

MD.3 Chose a value 9^ for the parameter vector sampled from the prior n(9). 

(r) (r) 

MD.4 Generate a data set from the model M with parameters 9 K n ' . 
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MD.5 Calculate d 2 (V ohs , V^) and 

— If r < R, set r = r + 1 and return to MD.3 or 

- If r = R, go to MD.6 

MD.6 Among all the parameters {0^; r = 0, 1, . . . , R}, chose the set of parameters 6 n € {0^} giving 
the closest simulated data, T> s - im to observations, i.e. d 2 (P b s , ^sim) = min r =o,...,i?{^ 2 (£ ) obs> ^sim)}- 

MD.7 Set n = n + 1 and return to MD.2 until n = n max . 

The result of this algorithm is a set of parameter vectors {6 n ; n = 0, 1, . . . , n max } that give simulated 
data with minimum distance to observations. The normalised histogram for the obtained parameter 
vectors {0 n } defines a p.d.f. p(6) that approximates the posterior ir(6\T>). 

For all the results presented in the paper and obtained with the MD algorithm, we set R = 5000. 
For some fungal epidemics, we have checked that larger values of R do not lead to smaller values of 
d 2 {V ohs , V sim ) in step [MD.6]. 

B. Approximate Bayesian Computation (ABC) 

In common with the MD method, the ABC approximate Bayesian method we have used also relies 
on the definition of a metric, d 2 . If the distance between observed and simulated datasets is less than 
a given tolerance parameter, e, i.e. d 2 (V b s , V s i m ) < e, then 6 is drawn from the approximate posterior 
7r (0|d 2 (P o f, s , T> s i m ) < e). The accuracy of the approximation increases as e — > 0. 

For our estimations, we used a Markov-chain Monte Carlo algorithm that can be summarised as 
follows: 

ABC.l Set n = and choose the initial value 6q of the parameter vector. 
ABC. 2 Generate a candidate vector, 0', from a proposal distribution q(6'\6 n ). 
ABC. 3 Generate a data set V s i m from the model M. with parameters 6'. 
ABC.4 Calculate d 2 (V ohs , V sim ) and 

- if d 2 (V ohs , V sim ) < e, go to ABC. 5; 

- if d 2 (D ohs , D sim ) > e, set n+ i = n and go to ABC. 7. 
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Rep. 


D 


B 


E 


1 


0.03 


1.5 


0.5 


2 


0.03 


1.5 


0.7 


3 


0.03 


1.5 


1.5 


4 


0.03 


1.5 


0.5 


5 


0.03 


1.5 


1 


6 


0.03 


1.5 


1 



TABLE I. Values of e used for methods B, D, and E (Table 1, main text) using the ABC inference 
method in step (hi) for prediction. 



ABC. 5 Calculate the probability of acceptance: 



Pa 



min ( 1, 



7r(e') q (e n \e') 



ir(O n )q(O'\0 r 

ABC. 6 Set n+ \ = 6' with probability p acc , or 6 n+ \ = n with probability 1 — p acc . 

ABC. 7 Set n = n + 1 and return to ABC. 2 until the chain has converged and the required number of 
samples has been collected. 

In our estimations, either a uniform distribution with the same support as the prior (see below) 
or normal distribution, A/"(0, a 2 ) were used for the proposal distribution q(.) (cf. steps ABC. 2 and 
ABC. 5). The criterion for the choice between these two distributions was to minimize the average 
number of simulations needed to generate a dataset with d 2 (D ^ s , D s im) < e in step [ABC. 4]. The value 
of a for the normal distribution was chosen according to the same criterion, and was typically a fraction 
between 0.05 and 0.1 of the support of the prior distribution. 

Every chain was run for 5 x 10 4 steps, discarding an initial burn-in period of 5 x 10 3 steps. Since 
very small values of the tolerance e imply very low acceptance rates, the final choice of e was the result 
of a tradeoff between the accuracy of the approximation and CPU time available. The values of e that 
were finally used in our analyses are shown in Table SH 

For both MD and ABC approaches, we assumed independent priors for all the parameters: it(0) = 
7r(T)7r(r e xp) (RF model) and it(0) = 7r(T)7r(ro)7r(A;) (CT model). For the RF model, all the priors 
were uniform: vr(T) = U(0, 1) and 7r(r ex p) = ^(^min, 7" max ), where r m \ n = Id and r max was changed 
between treatments (increasing with the lattice spacing, from T max = 6d for the lattice with a = 8mm 
spacing to r max = 12d for the lattice with a = 16mm spacing). For the CT model, we used two sets 
of priors: (i) noninformative uniform priors for all the parameters (vr(T) = U(0, 1), vr(ro) = U(0, 20), 
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n(k) = U(0, 20)) and (ii) noninformative prior for the transmissibility (vr(T) = U(0, 1)) and exponential 
priors for the other two parameters (tt(tq) = Exp(l), ir(k) = Exp(l) as for the augmented-data MCMC 
estimations). Only results for noninformative priors from set (i) are presented below, and compared 
with results from the RF model. The different choice for the priors from set (ii) has an effect on the 
posterior distributions, but does not affect significantly the predicted incidence curves (see Section 3) 
and, for the sake of brevity, results from set (ii) are not presented. 

The ABC and MD methods share several common features. Both of them rely on the simulation 
of epidemics using the current parameters, and on the calculation of a distance between simulated 
epidemics and the observed data. However, the ABC method considers any distance below the cutoff 
e, while the MD method seeks to minimise the distance over a given number of simulations R. The 
iterative procedure used in MD method allows the use of an additional parameter analogous to e in 
ABC method to be avoided. In both cases, the exact posterior is recovered in the proper limit (e — > 
and R — > oo, respectively). 

C. Markov Chain Monte-Carlo (MCMC) method with data augmentation 

In order to implement the data-augmented MCMC method, it is necessary to calculate the explicit 
form of the likelihood P(Z?|0) (with = (T, To, k)). We sketch here the main steps of the calculation. 
Let X be the set of the dots that are colonised before the end of the experiment (at time i e nd = 21 days), 
and U be the set of those that are still uncolonised at t = i e nd- Assume first that we know the times of 
colonisation tj of each dot j £ I. The data then consist of the vector t of colonisation times, plus the set 
of uncolonised dots, i.e. T> = (t, U). The nearest neighbours of dot j form the set A/}, and the potential 
donors of j form the subset Sj C A/j. If j £ U, then Sj contains the colonised neighbouring dots, i.e. 
Sj = {i : i £ A/} fll}. If in contrast j £ I, Sj contains the neighbouring dots that are colonised before 
j, i.e. Sj = {i : i £ Mj fll, ij < tj, }. 

Given these definitions, the likelihood function can be written as the product of the contributions 
from individual dots j: 

FCD\0) = Hff(t J )Hv^(t cnd ), (S.4) 

jei j&A 

where ff(tj), is the p.d.f. for the colonisation times tj, and V^(t en d) is the probability for dot j to be 
uncolonised by the end of the experiment. These contributions can be calculated explicitly as follows. 
The probability that a dot j € X has not been colonised by a given neighbour i 6 Sj by time tj is 
given by the survival function S(tj —t{) (see Eq. (|S.2p ). Hence, the probability Vj(tj) that dot j is still 
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uncolonised at time tj is given by the product over all i G 5ji 

= n s ^ - *o = n ex p (- r tl ^(*) dt ) = ex p ( - e r u dt ) . ( s - 5 ) 

where we used the relation S(t) = exp(— J Q * cj)(u)du). The p.d.f. for tj is then given by: 

f * {tj) = — m~ = E - **) ex p - E y o ^(*) dt . ( s - 6 ) 

\ie Sj J \ ie Sj j 

Likewise, a dot j G W is still uncolonised by time t en d when transmission did not occur from any of its 
neighbours % £ Sj, yielding the probability: 



if (t 



end ) 



II ^(W - t<) , if Sj£ , 
ie 5 i (S.7) 

1 , if 5,- = . 



In general, we are interested in obtaining the marginal distribution of a single parameter (in par- 
ticular, T) from Eq. (|S.3j) . and thus all other parameters entering the expression for P(D\9) have to 
be integrated out. This is, in general, unfeasible analytically. In the fungal invasion experiment, the 
calculation is further complicated by censoring, i.e., by the fact that experimental observations are made 
at discrete (1-day) time intervals. As a consequence, if dj is the day when dot j was first recorded as 
colonised, then the colonisation time tj is constrained to lie in the interval tj £ (dj — 1, dj), but its exact 
value is unknown. The actual dataset is hence T> = (d, U), and the likelihood has to be calculated from 
Eqs. (|S.4HS.7p by integrating out the unobserved colonisation times, 



P(d, U\0) = [ P(t, U\6)dt , 
JT(d) 



7(d) 

where the integral is performed over the (high-dimensional and, in general, very complex) space T(d) 
compatible with the observed data. 

Since the high-dimensional integrals introduced above are analytically intractable, numerical tech- 
niques are commonly used to sample values from the posterior distribution tt(6\T>). The MCMC method 
consists in implementing a Markov chain for that has tt(0\D) as stationary distribution. A large lit- 
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to which the reader is referred for details. In our case, 



erature exists on this subject (see e.g. Ref. 
a Metropolis-Hastings algorithm has been used to build the Markov chain. 
The algorithm can be summarised as follows: 

MCMC.l Set n = and choose the initial value Oq of the parameter vector. 

MCMC. 2 Generate a candidate value of the vector 6' from a proposal distribution q(0 \0 n ). 
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MCMC.3 Calculate the probability of acceptance: 

. / 7r(g'|P) \ . / P(V\0'M0')q(0 n \0') 
v acc = mm 1, — ; — : — - = mm 1, — ; — ; — - — ; — - — — r . — - 

MCMC.4 Set 6 n+ \ = 0' with probability p a cc 5 or ^n+i = 6n with probability 1 — p acc - 

MCMC.5 Set n = n + 1 and return to MCMC.2 until the chain has converged and the required number 
of samples has been collected. 

In order to deal with the unobserved colonisation times t, we used data augmentation [2]. Their values 
were treated as parameters to estimate, i.e. the original parameter vector 9 was expanded (augmented) 
to (0, t). 

New colonisation times were then proposed and accepted/rejected within the same Metropolis- 
Hastings algorithm. Additional care had to be taken in this case, since at each step a pathway of 
transmission must exist between the dot inoculated at time t = and all the other colonised dots in the 

rt n 

system (see the discussion in [2j and 6(]). 

Independent priors were used for all the parameters, so that vr(0) = 7r(T)7r(ro)7r(fe). A noninformative 
uniform prior was used for the transmissibility, i.e. vr(T) = U(0, 1). For the other two parameters, 
exponentially distributed priors were used, i.e. vr(ro) = Exp(l), ir(k) = Exp(l). Such choice was made 
in order to exploit our prior knowledge, i.e., respectively, that the typical time scale of the fungal spread 
in our system is of the order of days, and that the waiting-time distribution is not step-like (i.e., k 
is not too large; note that such assumption is opposite to that of the RF model). We checked that 
the posterior distribution was not too sensitive to changes up to a factor 2 in the parameters of the 
exponential priors. A uniform prior was used for each augmented colonisation time. 

Every chain was run for 10 5 MCMC steps discarding an initial burn-in period of 10 3 steps. We 
checked that the final posterior distribution was robust with respect to the choice of the inital point 6q. 

D. Distribution for the estimated transmissibility, p(T) 

This section complements the results presented in the main text (Figs. 2 and 3) for the p.d.f. p(T) 
obtained with different methods for parameter estimation in the fungal invasion experiment. Within 
the Bayesian framework, the analogous of the distribution p(T) introduced in the main text for the MD 
method is the marginal p.d.f of the posterior ir(0\T>) integrated over the variable r exp for the RF model 
and over the variables to and k for the CT model. 

Fig. S|T] shows the comparison of the probability density functions p(T) obtained for the agar dot 
experiment by all the methods summarised in Table 1 of the main text. The posteriors are shown for 
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Fig. SI. Comparison of marginal posterior distributions obtained with methods listed in Table 
1 of the main text. Different symbols and lines correspond to different methods, as indicated by the 
legend. All the replicates correspond to invasion in populations with lattice spacing a = 10 mm. 



experiments in populations with lattice spacing a = 10mm. This particular set of results was chosen 
because it summarised well all the main effects of the methodologies used. All the estimations correspond 
to observations over the complete duration of the experiment, i.e. t < i bs = 21 days. 
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III. ADDITIONAL COMPARISONS OF FITTED MODELS WITH 

EXPERIMENTAL DATA 

In the main text, we presented a test of the goodness of fit of models to data based on the mean 
squared distances d\ and dj. The purpose of this section is to give a more visual comparison between 
fitted models and observations based on the cumulative incidence. 

For every model used, we obtain the statistics for the incidence corresponding to fitted models by 
sampling the parameters from the joint (exact or approximate) posterior, ir(0\T)). This procedure 
gives a p.d.f. p(C\t) for the incidence C at any given time t. The dispersion of p(C\t) is associated both 
with the stochasticity in the simulated model for each value of the parameters, 6, and the dispersion for 
the values of these parameters given by n(0\T>). Strictly speaking, the comparison of the experimental 
incidence with that obtained by methods A-D based on the RF model makes sense only when the 
stochastic nature of the process is taken into account. Indeed, the fine details of the two types of 
processes are different: while the experimental curve C(t) corresponds to a discrete sampling of a 
continuous-time process (discrete set of observations), p(C\t) gives the statistics of effective discrete- 
time processes with random values of transmissibility corresponding to 6 being drawn from ir(0\D). 

Figs. S21 and show the comparison of the experimental incidence for systems with lattice spacing 
a = 10 mm with the estimations p(C\t) obtained from observations during time t < t Q bs = 21 days. 
Comparisons are presented for methodologies B, C, E, and F. Results for methodology A (based on the 
MD method in step (iii) for prediction) are comparable to those obtained by methodology B (based on 
the ABC method) and have not been plotted in Figs. S2] and [3] for clarity. Similarly, the comparison 
for methodology D (based on the ABC inference method) is similar to that plotted for methodology C 
(based on the MD method) (not shown in Figs. S2]and[3j). These results imply that, given a level of 
description of data (step (i)) and model in step (ii), results are quite independent of whether MD or 
ABC approaches are used to address step (iii). 

As can be seen from the figures, the p.d.f p(C\t) obtained by methods C, D, and E give a better 
description of the observed data for most of the replicates. On the other hand, the p.d.f. obtained 
by methodology F are often not able to capture the global trend of the incidence (see, e.g., replicate 
3). This suggests that the focus of methodology F on finer details of the evolution may prevent the 
predictability of the invasive properties of the system. As discussed in the main text, this might be due 
to the negative interplay between the simplicity of the CT model and the individual-based description 
of the data-augmented MCMC method for inference. 
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method B method C method E method F 

replicate 1 




10 20 10 20 10 20 10 20 



t (days) t (days) t (days) t (days) 

Fig. S2. Incidence (line with circles) for fungal invasion of a set of agar dots arranged on a triangular 
lattice with spacing a = 10 mm [3]. Replicates 1-3 are shown. For each replicate, four panels are shown, 
with the p.d.f. p(C\t) for the incidence C at any time t obtained by means of methods B, C, E, and F 
(as described in Table 1 of the main text). In all the panels, the ridge (bold solid line) corresponds to 
the median of p(C\t). The grey-scale shaded areas are the 20% (darker), 40%, 60%, and 80% (lighter) 
percentiles around the median. 

IV. FORECAST OF THE INCIDENCE 

In the main text, we quantified the differences between predictions and observations for all methods 
and fungal invasion experiments in terms of the quantities Ac and AF (cf. Fig. 6 of the main text). 
In this section, we give a more visual comparison between observations and predictions based on the 
temporal incidence, C(t). 
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method B method C method E method F 

replicate 4 




t (days) t (days) t (days) t (days) 



Fig. S3. Same as in Fig. S21 for replicates 4-6. 



Figs. and S2] show the comparison for the six replicates available in the experiments with a = 
10 mm and a = 12 mm, respectively. Predictions of the incidence between days 11 and 21 are based on 
estimates of the transmission parameters made during the first 10 days with methodology C. As can be 
seen, the estimated incidence provides a reasonable statistical description for the observed incidence for 
all the replicates. The quantity Ac quantifies the rms distance between the observed incidence and the 
predicted incidence obeying the p.d.f p(C\t) (given by the grey-scale shaded area in Figs. S3] and S2J). 
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Fig. S4. Forecast of the incidence for fungal invasion in a lattice of agar dots with a = 10 mm. 
Each panel corresponds to a different experimental replicate of the epidemic. The grey-scaled shaded 
area shows the p.d.f. p(C\t) for the numerically extrapolated incidence based on observations over time 
t < ^obs = 10 days, as marked by arrows. The grey-scale shaded areas are the 20% (darker), 40%, 60%, 
and 80% (lighter) percentiles around the median (bold solid line). 

V. ADDITIONAL RESULTS FOR THE NUMERICAL EXPERIMENTS 

A. Homogeneous transmission of infection 

Here, we give numerical support to the claim made in the main text that the most probable estimate 
for transmissibility, T*, corresponding to the maximum of the probability density function (p.d.f.) p(T), 
does not differ significantly from the actual transmissibility, T. This is shown in Fig. [6] where the 
estimate for the transmissibility, T, is plotted as a function of the transmissibility T. The estimates 
have been obtained for many different SIR numerical epidemics (~ 10 4 ) with T € [0, 1]. The analysis has 
been restricted to epidemics with final size N-& (i.e. the number of removed hosts) greater than a certain 
cut-off, No, in order to avoid estimates for small epidemics giving a poor estimator for transmissibility. 
Excluding small epidemics from the analysis also makes sense from a practical point of view because 
they are not a threat in terms of invasion. We have checked that the statistics for T do not depend 
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Fig. S5. Similar representation as in Fig. S3] for fungal epidemics in populations of agar dots with 
lattice spacing a = 12 mm. 

significantly on Nq for No > 5. For each epidemic i with given T, we obtain the p.d.f. pi(T\T) for 
the effective transmissibility, T, from observations of the initial stage (for t < i b s = 7r, where r is the 
infectious period for infected hosts which is taken as the unit of time, r = 1). Then, the mean p.d.f., 
(p(T\T)) e , is calculated by averaging pi(T\T) over N e (T) different stochastic realisations of epidemics 
with given value of T: 



(p(T\T)) e 



1 



N C (T) 



N e (T) f , 

The first moment of (p(T\T)) e gives an estimate for the mean (T*) e averaged over stochastic realisations. 
The dependence of (T*) e on T is shown by the continuous line in Fig. SSI The dispersion of {p(T\T)) c , 
shown by the shaded region in Fig. S£]as a function of T, contains contributions from both the width of 
each individual distribution, pi(T\T), and dispersion of the maxima for different replicates. In particular, 
the standard deviation, cr(T), of {p(T\T)) e is given by 

*(T)=[tf) c + *l] 1/2 , 

where 



N e (T) 



N e (T) 

E 

i=l 



T 2 Pi (T\T)dT 



1 f Pi (f\T)df\ 
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is the average over stochastic realisations of the variance of of pi(T\T). The quantity is the variance 
of Tic i over stochastic realisations, calculated as 

a * ~ N C (T) ^ N C (T) J • 

As can be seen from Fig. 93 the actual value for the transmissibility is statistically well described by 
the distribution (p(T\T)) e . The mean for the most probable estimate for the transmissibility, (7*} e , is in 
good agreement with the actual transmissibility. In particular, (T*) c provides an excellent estimate for 
T in the most interesting situations with T > 0.3 where invasion is more likely. The deviations of (T*) e 
from T are larger for small values of T because epidemics are typically small and the deviations are 
large. However, it is important to note that (T*) e overestimates T in these situations and thus provides 
a safe bound for invasion. 




Fig. S6. Estimates of the transmissibility for numerical SIR epidemics with homogeneous transmissibil- 
ity. The shaded yellow-brown region shows the levels of confidence as percentage of the p.d.f. (p(T\T)) c 
around the most probable mean transmissibility, (T*) e (continuous line), corresponding to each value 
of T. The dashed line representing the ideal situation (i.e. exact prediction) with T = T is given for 
comparison with the actual prediction shown by the continuous line. 
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B. Heterogeneous transmission of infection 

In the main text, we have analysed the epidemics in model systems with homogeneous transmissi- 
bility for all pairs of connected hosts. Realistic populations of hosts exhibit inherent heterogeneity in 
transmissibility and it is crucial to understand its effect on the prediction method introduced in the 
main text. 

In order to study the predictability of invasion for epidemics with heterogeneity in transmission 
of infection, we consider a simple but generic situation in which transmission is heterogeneous due 
to variability in the infectivity, X, and susceptibility, S, of hosts. As a first approximation, the rate of 
infection from an infected donor host with infectivity X^ to a susceptible recipient host with susceptibility 
S r is defined as /3d-r = X(±S T [8j. We assume that Id and S T are independent random variables distributed 
according to truncated normal distributions, Af(T,aj) and Af(S, <r|), respectively, which are the same 
for each host [J]. The mean values, X and S, provide an effective measure of the mean strength of the 
transmissibility while the standard deviations, ax and as, characterize the degree of heterogeneity. The 
multiplicative form of the infection transmission rate = XdS r brings correlations in transmissibilities, 
T&_ T = 1 — e T ^ d - r [10]. Indeed, all the transmissibilities from a donor are affected by the value of 
and thus they are not independent. Similarly, all the transmissibilities to a recipient are influenced 
by its susceptibility S r and thus also correlated. Such correlations make the invasion probability for 
heterogeneous system to be dependent on the whole set of transmissibilities {Td_ r }. In spite of that, the 
method based on a single effective transmissibility, T, and Eq. [1] in the main text is still applicable 
and useful. In realistic situations, the transmissibility is often assumed to be homogeneous because the 
precise degree of heterogeneity in transmission of infection is unknown and difficult to infer in detail. 

In order to test our methodology in heterogeneous systems with different strengths of transmissibility, 
we perform numerical experiments for epidemics with ag = ax = 0.2, mean infectivity set to X = 0.4 
and variable S. Again, observations are made over an initial interval of time t < t b s = 7r for estimation 
of the effective transmissibility, T. Similarly to the analysis given in the previous section for epidemics 
with homogeneous transmission, Fig. S21 shows a comparison between the p.d.f. {p(T\{T))) e averaged 
over stochastic realisations of epidemics (shaded region) with the ideal situation giving exact prediction 
of the spatially averaged transmissibility used in the simulations, i.e. T = (T) (dashed line). As can 
be seen, the estimates are statistically consistent with (T). In fact, the mean of the most probable 
transmissibility, (T*) e , gives a good description for (T) (compare the continuous and dashed lines in 
Fig. S[7J). We proceed further as in the case with homogeneous transmission by calculating P; nv (L) for 



18 




Fig. S7. Estimates for the effective transmissibility in numerical SIR epidemics with heterogeneous 
transmissibility introduced by Gaussian randomness in infectivity X and susceptibility S (with ax = 
as = 0.2, mean infectivity set to X = 0.4, and variable mean susceptibility 0.1 < S < 3.5). The shaded 
yellow-brown region shows the levels of confidence as percentage of the p.d.f. (p(T\(T))) c around the 
most probable mean transmissibility, (T+) e (continuous line). The dashed line representing the ideal 
situation (i.e. exact prediction) with T = (T) is given for comparison with actual prediction shown by 
the continuous line. 

each of the numerical epidemics in a system of size L = 51 (see Fig. 2(a) in the main text) by using 
Eq. [1] in the main text, the estimated p(T), and the probability of invasion Phom (T', L) for systems 
with homogeneous transmissibility equal to T. Note that here Ph_ om (T; L) corresponds to the function 
denoted as P{ nv (T;L) in the main text. The notations have been changed in order to distinguish 
between the probability of invasion in homogeneous systems and the probability of invasion in the 
presence of heterogeneity, denoted as Phet(X; L). The results of our estimations are shown in Fig. 33 
The relation between P; nv (L) and Ph om (T; L) is very similar to the relation reported in the main text 
for epidemics with homogeneous transmission. Indeed, for epidemics with low transmissibility, P{ nv (L) 
typically overestimates Pi nv . In contrast, for more invasive epidemics, Pi nv (L) underestimates Pi nv for 
most of the possible effective transmissibilities. Although this comparison has some interest, in the 
current situation it makes more sense to compare the estimated probability of invasion with the actual 
probability of invasion in heterogeneous system, Phet(X;-L), that can be calculated numerically and 
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Transmissibility 



Fig. S8. Numerical experiments of SIR epidemics with heterogeneous transmissibility induced by 
Gaussian randomness in infectivity X and susceptibility S (with erj = as = 0.2, mean infectivity 
set to I = 0.4, and variable mean susceptibility 0.1 < S < 3.5). The lines marked by circles and 
squares correspond to the probabilities of invasion Ph om and Phet plotted vs (T) for homogeneous and 
heterogeneous systems of size L = 51, respectively,. Estimates for the probability of invasion, P nv (L), 
have been evaluated for each epidemic (out of ~ 10 4 ) using the p.d.f. p(T) obtained by observing the 
initial evolution during time t < t ^ s = 7t and then fitting the observed spatio-temporal map by the 
shell-evolution function. Horizontal slices of the yellow-brown shaded area corresponding to a fixed 
value of P nv (L) represent the distribution p(T) averaged over realisations of epidemics with the same 
value of Pi nv (L). The points on the ridge (black solid curve) of the shaded area correspond to the most 
probable transmissibility, T*, averaged over epidemics with a certain value of P; nv (L). 

shown by the line with squares in Fig. S51 The comparison of this line with the shaded region reveals 
that the estimated P nv (L) overestimates the actual probability of invasion in most of the situations 
both for cases with low and high transmissibility. Therefore, the estimates of P nv typically provide 
safe bounds for the probability of invasion. In fact, the larger the heterogeneity in susceptibility and/or 
infectivity, the safer the bound is. This is a consequence of the inequality P nv ((T)) > P^ et ((T)) that 
holds for any given value of (T) under quite general conditions due to the existence of correlations in 



transmission induced by heterogeneity in the transmission rates [8|, |llj. Indeed, Fig. £[H shows that the 
inequality holds for the numerical experiments considered here with (T) = T (i.e. the line corresponding 
to the heterogeneous case marked by the squares is below the line marked by the circles for homogeneous 
system) . These results are particularly encouraging for analysis of realistic epidemics in which a certain 
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degree of heterogeneity in susceptibility and infectivity of hosts is expected to be ubiquitous. 

VI. DIFFERENCES BETWEEN P inv AND P inv 

In this section, we discuss the origin of the difference between the estimated probability of invasion, 
Pi nv (L), evaluated at the most probable transmissibility, T*, and the actual probability of invasion 
Pinv{T;L) one would obtain if the transmissibility of the epidemic was known exactly. 

To start with, recall that the relation between Pj nv (L) and P im (T\ L) is given by Eq. [1] in the main 
text. As we have seen, p{T) is a peak-shaped function approximately symmetric about the peak position 
at T*. Therefore, the peak region will mainly contribute to the integral in Eq. [1] of the main text. If 
is not too close to the inflection point of P mY (T\ L), the Taylor series expansion of Pi nv (T; L) in (T — T+) 
to second order, i.e., 

P inv (T; L) = P inv (t; L) + P( nv (t; L)(f - t) + \pL(t; L)(T - t) 2 , (S.10) 

is sufficient to estimate the deviation of Pi nv from Pi nv - Indeed, substitution of Eq. (|S.10p into Eq. [1] 
of the main text gives: 

Pinv(L) = P inv (t; L) + P( nv (t; L) f p(f)(f - T*)dT + ^ V (T„; L) I p(f)(f - t) 2 df (S.ll) 

Jo 1 Jo 

The distribution of T is approximately symmetric around the maximum, i.e. p(T — T±) ~ p(T± — T) 
(see Fig. 2(b) in the main text), and thus the term containing P( nv in Eq. (jS.lip is negligible in comparison 
with other terms in the sum. This means that 

P inv (L) > P hlv (t; L) if P(^ v (t; L) > 
PinAL) < PmAt; L) if ^(f*; L) < , (S.12) 

where we have taken into account that the last integral in Eq. (jS.lip is positive. The above inequalities 
demonstrate that the value and sign of Pi nv (L) — Pi nv (T*;L) depends on the curvature of Pi nv (T;L) 
around T = T* which is given by P(^ Y - 
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